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Abstract 

Dependent nonparametric processes extend distributions over mea- 
sures, such as the Dirichlet process and the beta process, to give distri- 
butions over collections of measures, typically indexed by values in some 
covariate space. Such models are appropriate priors when exchangeability 
assumptions do not hold, and instead we want our model to vary fluidly 
with some set of covariates. Since the concept of dependent nonpara- 
metric processes was formalized by MacEachern [I], there have been a 
number of models proposed and used in the statistics and machine learn- 
ing literatures. Many of these models exhibit underlying similarities, an 
understanding of which, we hope, will help in selecting an appropriate 
prior, developing new models, and leveraging inference techniques. 

1 Introduction 

There has recently been a spate of papers in the statistics and machine learning 
literature developing dependent stochastic processes and using them as priors 
in Bayesian nonparametric models. In this paper, we aim to provide a repre- 
sentative snapshot of the currently available models, to elucidate links between 
these models, and to provide an orienting view of the modern constructions of 
these processes. 

Traditional nonparametric priors such as the Dirichlet process [DP, , Chi- 
nese restaurant process [CRP, H] , Pitman- Yor process [4] and the Indian buffet 
process [IBP, [5] assume that our observations are exchangeable. Under the as- 
sumption of exchangeability the order of the data points does not change the 
probability distribution. 

Exchangeability is not a valid assumption for all data. For example, in 
time series and spatial data, we often see correlations between observations at 
proximal times and locations. The fields of time series analysis [6] and spatial 



statistics [7] exist to model this dependence. In fact, many data sets contain co- 
variatc information, that is variables we do not wish to model but only condition 
on, that we may desire to leverage to improve model performance. 

The use of non-exchangeable priors in a Bayesian nonparametric context is 
relatively new. While not the first model to address non-exchangeability in a 
nonparametric framework (this honor arguably goes to 8]), the seminal work in 
this area is a technical report by MacEachern [1 .. In that paper, MacEachern 
formally introduces the idea of dependent nonparametric processes, and pro- 
poses a set of general desiderata, described in Section |3j that paved the way for 
subsequent work. 

Loosely dependent nonparametric processes extend existing nonparametric 
priors over measures, partitions, sequences, etc. to obtain priors over collections 
of such structures. Typically, it is assumed that the members of these collec- 
tions are associated with values in some metric covariate space, such as time or 
geographical location, and that locations that are close in covariate space tend 
to generate similar structures. 

Since MacEachern's technical report, there has been an explosion in the 
number of such dependent nonparametric processes. While, at first glance, the 
range of models can seem overwhelming, the majority of existing models fall 
under one (or sometimes more) of a relatively small number of classes: 

• Hierarchical models for grouped data with categorical covariates. 

• Collections of random measures where the atom locations vary across co- 
variate space. 

• Collections of random measures where dependency is introduced via a 
stick-breaking representation. 

• Collections of locally exchangeable sequences generated by inducing de- 
pendency in the family of conditional distributions in a de Finetti repre- 
sentation. 

• Collections of locally exchangeable sequences obtained my modifying the 
predictive distribution of an exchangeable sequence. 

• Collections of random measures obtained by exploiting properties of the 
Poisson process. 

• Collections of random measures obtained by superpositions and convex 
combinations of random measures. 

We find there are many similarities between the models in each class. There 
are also differences, particularly in the form of covariate space assumed: Some 
models are appropriate only for time [9l [10l Qj] , while others consider general 
underlying covariates [TH [T31 HH [Hj . Different authors have applied varying 
inference techniques to highly related models, and have adapted their models 
to numerous applications. 
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By pointing out similarities between existing models, we hope to aid under- 
standing of the current range of models, to assist in development of new models, 
and to elucidate areas where inference techniques and representations can easily 
be transferred from one model to another. 

This is not the first survey of dependent nonparametric processes: in par- 
ticular, a well-written survey of some of the earlier constructions can be found 
in [TB|. However, the subfield is growing rapidly, and Dunson's survey does 
not incorporate much recent work. In particular, while early research in this 
area focused on dependent Dirichlet processes, the machine learning commu- 
nity in particular has recently begun exploring alternative dependent stochastic 
processes, such as dependent beta processes [T!?l RHI IT51 117] . 

In addition to describing the range of dependent nonparametric priors in the 
current literature, we explore some of the myriad applications of these models 
m Section El 

The reader will realize, while reading this paper, that not all of the models 
presented fit into MacEachern's original specification. For example, many of the 
kernel-based methods do not exhibit easily recognizable marginal distributions, 
and some of the models do not satisfy marginal invariance. In our conclusion, 
we discuss how the original desiderata of pQ pertains to the current body of 
work, and consider the challenges currently facing the subfield. 

2 Background 

The dependent nonparametric models discussed in this survey are based on a 
relatively small number of stationary Bayesian nonparametric models, which we 
review in this section. We focus on nonparametric priors on the space of cx-finitc 
measures. In particular, we consider two classes of random measure: random 
probability measures, and completely random measures. We also discuss ex- 
changeable sequences that are related to these random measures. 

2.1 Notation 

We introduce some notation here for convenience. We denote unnormalized 
random measures as G and random probability measures as P. We use X for 
an arbitrary covariate space with elements £ X. When the covariate 

is time, X = K + , we use t for an observed covariate. We denote a random 
measure evaluated at a covariate value x £ X as G^ , and for an indexed set of 
covariates, {x{\, as Gi = G^ Xi \ The notation Gi is used extensively when we 
observe uniformly spaced observations in time. 

2.2 Completely random measures 

A completely random measure [CRM, [TB] is a distribution over measures on 
some measurable space (6, Je), such that the masses G{A\), G{A>2), ■ ■ . assigned 
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to disjoint subsets A\, j4a, . . . £ by a draw G from the CRM are independent 
random variables. 

A CRM on some space O is characterized by a positive Levy measure 
v(d0,d,7r) on the product space 6 x R + , with associated product cr-algebra 
J-@ <8 ^Tti. • Completely random measures have a useful representation in terms 
of Poisson processes on this product space. Let IT = {(6k,iTk) £ x M+I^-l 
be a Poisson process on 9 x R + with rate measure^] given by the Levy measure 
v(dQ, dir). We denote this as IT ~ PP(^). Then the completely random measure 
with Levy measure v(d6,di:) can be represented as 

oo 

G = J2^Se k (1) 
k=i 

(see [H] for details). We denote a draw from a CRM with Levy measure v as 
G ~ CRM(v). Due to the correspondence between CRMs and Poisson processes, 
we can simulate a CRM by simulating an inhomogeneous Poisson process with 
appropriate rate measure. 

An alternative representation of CRMs, often referred to as the Ferguson- 
Klass representation and more generally applicable to pure-jump Levy processes, 
is given by transformation of a unit-rate Poisson process on K + [19] . Specifically, 
let wi,M2, . . . be the arrival times of a unit-rate Poisson process on K + , and 
denote the tail T(s) of the Levy measure v as T(s) = v(Vl, (s, oo)). Then the 
strictly decreasing atom sizes of a CRM are obtained by transforming the strictly 
increasing arrival times u± < v,2 < . ■ ■ according to Tffc = T (uk)- 

If the Levy measure is homogeneous - ie v(dO,dir) = vg(d8)i> 7r (d'K) - the 
atom locations are distributed according to vg(dQ). If the Levy measure is 
inhomogeneous, we can obtain the conditional distribution of an atom's location 
given its size by solving the transport problem described in [20] . 

The class of completely random measure includes important distributions 
such as the beta process, gamma process, Poisson process, and the stable sub- 
ordinator. Such distributions are often used as priors on the hazard function in 
survival analysis. See |21j for a recent review of completely random measures 
and their applications. 

2.3 Normalized random measures 

Since any cr-finite measure on implies a probability measure on 0, we can 
construct a distribution over probability measures by normalizing the output of 
a completely random measure. The resulting class of distributions arc referred 
to as normalized random measures [NRM, 22 . Specifically, for G = ^ fe 7^kSe k ~ 
CRM(^), define the probability measure 

oo 

P = J2 ^ S 6k ~ NRM(v) . (2) 
k=i ^3=1 n i 

lr The rate measure is E[II(A)], that is the expected number of points of the Poisson process 
that fall in a measurable set A. 
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The most commonly used exemplar is the Dirichlet process, which can be 
obtained as a normalized gamma process. Distributions over probability mea- 
sures are of great importance in Bayesian statistics and machine learning, and 
normalized random measures have been used in many applications including 
natural language processing , image segmentation and speaker diarization |23j . 



2.4 Exchangeable sequences 

Often, we do not work directly with the random measures described in Sec- 



tions 2.2 and 2.3 Instead, we work with closely related exchangeable nonpara- 
metric sequences. Recall that an infinitely exchangeable sequence is one whose 
probability is invariant under finite permutations t„ of the first n elements, for 
all n € N 24]. De Finetti's theorem tells us that any infinitely exchangeble 
sequence Xi, X2, ■ ■ ■ can be written as a mixture of i.i.d. samples 

/n 
Y[Q e (Xi)P(dB) (3) 
»=i 

where {Qe, 9 G 0} is a family of conditional distributions and P is a distribution 
over O called the de Finetti mixing measure. 

We can obtain a number of interesting exchangeable sequences by letting 
be a space of measures and P be a distribution over such measures. For 
example, if P is a Dirichlet process, and Qg is the discrete probability distribu- 
tion described by the probability measure 9, then we obtain a distribution over 
exchangeable partitions known as the Chinese restaurant process [CRP, 24J. 

Similarly, we can use a completely random measure as the de Finetti mixing 
measure, to obtain a sequence of exchangeable vectors (or equivalently, a matrix 
with exchangeable rows). For example, if we combine a beta process prior with 
a Bernoulli process likelihood, and integrate out the beta process-distributed 
random measure, we obtain a distribution over exchangeable binary matrices 
referred to as the Indian buffet process [IBP, [25] . The IBP has been used 
to select subsets of an infinite set of latent variables in nonparametric versions 
of latent variable models such as factor analysis and independent component 
analysis [25]. Other distributions over exchangeable matrices have been defined 
using the beta process [27J and the gamma process [IT] [28] as mixing measures. 

Conjugacy between the de Finetti mixing measure and the family of condi- 
tional distributions often means the predictive distribution P(X n+ i\Xi, . . . , X n ) 
can be obtained analytically. For example, the predictive distribution for the 
Chinese restaurant process is given by 



1 ^ if m k > 
p(X n+1 \X 1 ,...,X n ) = { n p k . (4) 

1 — <— otherwise 



, ™+7 



where m/- is the number of observations in X\ , . . . , X n that have been assigned 
to cluster k, and 7 is the concentration parameter of the underlying Dirichlet 
process. The distribution represented by Equation|4]can be understood in terms 
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of the following analogy. Consider a sequence of customers entering a restaurant 
with an infinite number of tables, each serving a single dish. Let z n denote the 
table that customer n sits at, and let K + denote the number of tables occupied 
by at least one customer. Customer n + 1 enters the restaurant and sits at 
table k where dish 9k is being served, with probability proportional to m^, the 
number of previous customers to have sat at table k. This is indicated by setting 
Zn+i = k. Alternatively, the customer may sit at a new table, z n +i = 

K+ + 1, 

with probability proportional to 7 and chooses a new dish, 9 ~ Hq, for some 
measure H . 

3 Dependent nonparametric processes 

If we define a nonparametric process as a distribution over measures with count- 
ably infinite support, then a dependent nonparametric process is a distribution 
over collections of such measures. Joint distributions over discrete measures 
have been proposed since the early days of Bayesian nonparametrics (for ex- 
ample [5]), but a formal framework was first proposed in a technical report of 
MacEachern pQ. MacEachern proposed the following criteria for a distribution 
over collections {G {x \x e X} of measures: 

f . The support of the prior on {G^ : x € A} for any distinct set A should 
be large. 

2. The posterior distribution should be reasonably easy to obtain, either 
analytically or computationally. 

3. The marginal distribution of G^ x ' should follow a familiar distribution for 
each x € X. 

4. If a sequence of observations converges to some xq then the posterior 
distribution of will also converge to some G^\ 

This specification is vague about the form of the dependence and of the space 
X . It is typically assumed that (X, d) is some metric space, and that as x' — > x 
then G^') -> G<». There are, however, distributions in the class of dependent 
nonparametric processes that do not require a metric space. Models such as the 
hierarchical Dirichlet process j^Hj create distributions over exchangeable, but 
not independent, measures. Other models that fall under this category include 
the hierarchical beta process [25 and a number of partially exchangeable models 

ESG2J]. 

In this survey, we focus on distributions over collections of measures in- 
dexed by locations in some metric space. Examples of typically used spaces 
include R+ (for example, to represent continuous time), N (for example, to rep- 
resent discrete time), or K d (for example, to represent geographic location). We 
will, however, spend a little time discussing the two most commonly used non- 
covariate-indexed dependent nonparametric processes, the hierarchical Dirichlet 
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process and the hierarchical beta process, since they are frequently used as part 
of other dependent nonparametric processes. 

We will also consider distributions over collections of partitions and vectors 
that can be seen as dependent extensions of models such as the CRP and the 
IBP. It is clear, for example, that any dependent Dirichlet process (DDP) can 
be used to construct a dependent CRP, by constructing a distribution over 
partitions using the marginal measure at each covariate location. However, we 
will see that we can also construct dependent Chinese restaurant processes and 
Indian buffet processes by directly manipulating the corresponding stationary 
model, and that such distributions do not necessarily correspond to a simple 
random measure interpretation. 

4 Distributions over exchangeable random mea- 
sures 

While most of the processes we will present in this paper consider collections of 
measures indexed by locations in some covariate space endowed with a notion of 
similarity, the definition provided by MacEachern does not require such a space. 
In fact, one of the most commonly used dependent nonparametric processes is 
a distribution over an exchangeable collection of random measures. 

The hierarchical Dirichlet process [HDP, [25] is a distribution over multiple 
correlated discrete probability measures with a shared support (that is, the lo- 
cations of the atoms are shared across the random measures. Each probability 
measure is distributed according to a Dirichlet process, with a shared concen- 
tration parameter and base measure. In order to ensure sharing of atoms, this 
base measure must be discrete; this is achieved by putting a Dirichlet process 
prior on the base measure, resulting in the following hierarchical model: 

G ~DP( 7o ,ff) 

Gj :~DP( 7 ,Go), j = l,2,... 

The HDP is particularly useful in admixture models, where each data point 
is represented using a mixture model, and components of the mixture model are 
shared between data points. It has been used in a number of applications in- 
cluding text modeling [29], infinite hidden Markov models [291 , 131 ] . and modeling 
genetic variation between and within populations [32] . Compared with most of 
the covariate-dependent nonparametric processes described in this survey, infer- 
ence in the HDP is relatively painless; a number of Gibbs samplers are proposed 
in [25J , and several variational approaches have also been used [33 , 3U |35] . Ad- 
ditionally, sequential Monte Carlo methods [SS] have been developed for HDP 
topic models [37] . 

Other hierarchical nonparametric models can be constructed in a similar 
manner. The hierarchical beta process ^Z5\ replaces the hierarchy of Dirichlet 
processes with a hierarchy of beta processes, allowing the creation of correlated 
latent variable models. 
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As we will see throughout this paper, hierarchical models such as these 
are often used in conjunction with an explicitly covariate-dependent random 
measure, for example by replacing the DP-distributed base measure Go with a 
DDP-distributed collection of base measures. 

5 Dependence in atom location 

Consider a distribution over collections of atomic measures 



One of the simplest ways of inducing dependency is to assume a shared set of 

(x) 

atom sizes ir k — 7Tfc, fc = 1, 2, . . . ,x G X, and allowing the corresponding atom 

locations fljjf' to vary according to some stochastic process. 

This is equivalent to defining a Dirichlet process on the space of stochastic 
processes, and variations on this idea have been used in a number of models. 
This construction was first made explicit in defining the single-p DDP [35] . The 
spatial DP [515] replaces the stochastic processes in the single-p DDP with ran- 
dom fields, to create a mixture of surfaces. The ANOVA DDP [40] creates a 
mixture of analysis of variance (ANOVA) to model correlation between mea- 
sures associated with categorical covariates. The Linear DDP [?T] extends the 
ANOVA DDP to incorporate a linear term into each of the mixture components, 
making the model applicable to continuous data. Since these models can be in- 
terpreted as Dirichlet process mixture models, inference is generally relatively 
straightforward. 

Since, for most nonparametric random measures found in the literature (the 
inhomogeneous beta process [H] being the main exception), the locations of the 
random atoms are independent of their size, this approach can be used with any 
random measure to create a dependent random measure. In addition, it can be 
combined with other mechanisms that induce dependency in the atom sizes - 
for example the Markov-DDP of [9 and the generalized Polya urn model of [TO] 
both employ this form of dependency, in addition to mechanisms that allow the 
size of the atoms to vary across covariate space. 

6 Stick-breaking constructions 

A large class of nonparametric priors over probability measures of the form 
P := YlkLi ^fc^flfc can be constructed using an iterative stick-breaking construc- 
tion [13J [33] , wherein a size-biased ordering of the atom masses is obtained by 
repeatedly breaking off random fractions of a unit length stick, so that 



oo 



k 



X }. 



nk = V k *[[0.-V j ) 



(6) 



V fe ~ Beta(a fc A) . 
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Many commonly used random probability measures can be represented in this 
manner: if we let a,k = — 7 we recover the Dirichlet process; if we let 

cifc = 1 — a, bfc = b + ka we recover the Pitman- Yor, or two-parameter Poisson- 
Dirichlet process. A similar procedure has been developed to represent the form 
of the beta process most commonly used in the IBP [IS] . A number of authors 
have created dependent nonparametric priors by starting from the stick-breaking 
construction of Equation [6] 

6.1 Varying the beta random variables across covariate 
space 

The multiple-p DDP pQ replaces the beta-distributed random variable Vk in 
Equation [6] with a stochastic process Vk (x), whose a;-marginals are distributed 
according to Beta(l, a). For example, Vk might be obtained by point-wise trans- 
formation of some stochastic process whose marginal distribution function is 
known and continuous, such as a Gaussian process. The resulting marginal dis- 
tribution over random probability measures G^ x ' := Y^kLi 1T k{x)8g k are Dirichlet 
processes by construction. While elegant, inference in the the multiple-p DDP 
is computationally daunting, which goes some way to explain why it has not 
been used in real- world applications. 

A related model is the kernel stick-breaking process [KSBP, 05] . The KSBP 
defines a covariate-dependent mixture G^ x ' of a countably infinite sequence of 
probability measures G* as 

00 k— 1 

k=i j=i 

where K — > [0, 1] is a bounded kernel function, Vk l ~ d Beta(a&, bk) and {fik G 
a set of random covariate locations for the sticks. If K = 1 then we recover 
the class of stick-breaking priors; if K varies across X then the weights in the 
corresponding marginal stick-breaking processes vary accordingly. 

While the multiple-p DDP varies the atom weights in such a manner as to 
maintain Dirichlet process marginals, the KSBP, in general, does not. Instead, 
it modulates the beta-distributed weights using an arbitrary kernel, resulting in 
marginally non-beta weights. This model is much easier to perform inference in 
than the multiple-p DDP described above; MCMC schemes have been proposed 
based on a Polya-urn representation or using slice sampling. A hierarchical 
variant of the KSBP has also been proposed, along with a variational inference 
scheme |37] . 

The matrix stick-breaking process |48) is appropriate for matrix or array- 
structured data. Each row m of the matrix is associated with a sequence 

U m k ~ Beta(l, 7;j),fc = l,2,..., and each column j is associated with a corre- 
sponding sequence Wjk *~ Beta(l,7w). At location x m j, corresponding to the 
jth clement in the mth row of the matrix, a countable sequence of atom weights 
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is constructed as 

k-i 

n mj k = U mk W jk Y[{1- U ml W jt ) . (7) 
»=i 

This model is invariant under permutations of the rows and columns, and does 
not depend on any underlying metric. 

6.2 Changing the order of the beta random variables 

The multiple-p DDP, KSBP, and matrix stick-breaking process all involve chang- 
ing the weights of the beta random variables in a stick-breaking prior. A different 
approach is followed by Griffin and Steel in constructing the ordered Dirichlet 
process, or 7T-DDP 49 . 

In the 7T-DDP, we have a shared set {14}^*^ of Beta(l, a) random variables 
and a corresponding set {0k}kLi °f random atom locations. At each covariate 
value x e X, we define a permutation a^ x \ and let G<- x > = XT = 

where tt^ — V^)^ YljZi 0- ~~ ^o-W(i)- ^ nc method of defining such a per- 
mutation is to associate each of the (Vk,0k) pairs with a location fj,k € X, and 
taking the permutation implied by the ordered distances — x\. 

A related approach is the local Dirichlet process [SU] . Again, we have shared 
sets of beta random variables 14 , locations in parameter space 8k and locations 
in covariate space In the local Dirichlet process, for each x G X we combine 
the beta random variables, and associated parameter locations, for atoms whose 
covariate locations are within a neighborhood of x, i.e. \fik — x\ < <f>: 

G (x) =J2pk(x)5 e ^, (8) 

k=l 

where C x = {fit ■ |Mfe — x \ < 0}j 7r fc( a; ) is the fcth ordered index in C x , and 

Pk{x)^V^ k(x) \{{l-V^ {x) ). (9) 

j<k 

While methods that manipulate the stick-breaking construction are often ele- 
gant, they can be limiting in the form of dependency available. In the multiple-p 
DDP and local DP, the size-biased nature of the stick-breaking process will mean 
that the general ordering of the atom sizes (at least for the atoms contribut- 
ing) will tend to be similar. In the 7T-DDP where the permutation is defined 
by the relative distances, atoms are constrained to increase monatonically with 
distance to a maximum size and then decrease. In addition, changes in covariate 
location will tend to only effect the larger atoms. 

7 Dependence in conditional distributions 

According to de Finetti's theorem, any exchangeable sequence is i.i.d. given 
some (latent) conditional distribution, and can be described using a mixture of 
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such distributions. For example, in the CRP, the conditional distributions are 
the class of discrete probability distributions, and the mixing distribution is a 
Dirichlet process. In the previous sections, we have discussed ways of inducing 
dependency in the mixing distribution, and assumed that observations are i.i.d. 
at each covariate value. 

An alternative method of inducing dependency between observations is to 
assume the mixing distribution is common to all covariate values, but the con- 
ditional distributions are correlated. For example, in the Chinese restaurant 
process, this would mean the mixing measure for all covariate values is shared 
and distributed according to a single Dirichlet process, and the conditional dis- 
tributions are correlated but are marginally distributed according to that mixing 
measure. 

The generalized spatial Dirichlet process [51] [52] is an extension of the spatial 
Dirichlet process [39] that replaces the conditional distribution (in this case, a 
multinomial) with a collection of correlated conditional distributions. Recall 
that the spatial Dirichlet process defines a Dirichlet mixture of Gaussian random 
fields on some space X . A sample Y from such a process is a realization of the 
field associated with a single mixture component. 

The generalized spatial Dirichlet process aims to allow a sample to contain 
aspects of multiple mixture components. A naive way of achieving this would 
be to sample a different field at each location x £ X, but this would not result 
is a continuous field over X. Instead, the generalized spatial Dirichlet process 
ensures that the value of the field at a given location is marginally distributed 
according to the Dirichlet process mixture at that location, but that as x — > xo, 
p(Y(x) = 8i(x),Y(x ) — 8j(xo)) tends to if i ^ j, or to p(Y(x ) = 0i(xo)) 
otherwise. 

This behavior can be achieved as follows. Let {Zi(x), x £ X,i = 1,2,...} 
be a countable collection of independent Gaussian random fields on X with unit 

variance and mean functions rrii(x) such that $(mj(a:)) Beta(l, 7). Then at 
each x £ X, Y(x) = 9 k( x ){x) , k{x) : Z^i^ix) — max^ Zi(x) . 

The idea of using latent surfaces to select mixture components and hence 
enforce locally similar clustering structure is explored further in the latent stick- 
breaking process [53) ■ In the generalized spatial Dirichlet process, latent surfaces 
were combined with a Dirichlet process distribution over surfaces. In the stick- 
breaking process, latent surfaces are used to select locally smooth allocations 
of parameters marginally distributed according to an arbitrary stick-breaking 
process, and the authors consider multivariate extensions. A similar method is 
used in the dependent Pitman- Yor process of [S3] to segment images. Here, the 
authors work in a truncated model and use variational inference techniques. 

A related method is employed in the dependent IBP [13] . Recall that, in 
the beta-Bernoulli representation of the IBP, we have a random measure G = 
Sfe=i 7r fe^fej an d each element z n k of a binary matrix Z is sampled as z n k ~ 
Bernoulli(7r fc ). The dependent IBP couples matrices Z(x) = [z nk (x)),x e X 
by jointly sampling the elements z n k(x) according to a stochastic process with 
Bernoulli(7Tfc) marginals. In practice, this is achieved by thresholding a zero- 
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mean Gaussian process with a threshold value of <J> 1 (nk)- 

8 Dependence in predictive distributions 

Recall that when conjugacy exists between the de Finetti mixing measure and 
the sampling distribution the predictive distribution p(X n+ i \X±, . . . , X n ) can in 
most cases be obtained analytically. In this section we describe two approaches 
to induce dependence using the predictive distribution. 

The first approach induces dependence in a partially exchangeable^] sequence 
of observations that arrive over time in batches by creating Markov chains of 
CRPs that incorporate a subset of the observations from the previous time into 
the current CRP. Such approaches maintain CRP-distributed marginals, but 
are difficult to extend to covariates other than time. 

An alternative approach is to modify the predictive distribution to explicitly 
depend on a covariate (or some function thereof) . This form of construction can 
be applicable to sequential or arbitrary covariates. Unlike many of the models 
described elsewhere in this survey, these models are not based on a single shared 
random measure. These models can be easily adapted to arbitrary covariate 
spaces, but in general lack the property of marginal invariance. 

8.1 Markov chains of partitions 

The generalized Poyla urn [GPU, 10 constructs a DDP over time by leveraging 
the invariance of the combinatorial structure of the CRP with respect to sub- 
sampling. Specifically, at time t — 1 draw a set of atom assignments for m cus- 
tomers, Zx = {z\ i, ^2, lj • • ■ , z nit x}, and associated atoms 9^ — {#1,1, . . . , 0kx,i} 
according to a CRP with base measure Go on 0, where K\ is the number of 
tables with a customer at time 1. For t > 2, some subset of the existing cus- 
tomers leave the restaurant, according to one of two deletion schemes (or a 
combination) : 

1. Size-biased deletion: An entire table is deleted with probability pro- 
portional to the number of people say at that table. 

2. Uniform deletion: Each customer in the restaurant decides indepen- 
dently to stay with some probability q, otherwise they leave. 

All remaining atoms that existed at time t—1, 0k,t—i, are updated by a transition 
kernel T(0 ktt \9 k , t -i) such that J T{6 k , t \d k ,t-i)G Q {d6 k , t -x) = G o (d0 k , t ). This is 
similar to the single-p DDP and related models described in Section [Hj 

For each of n t customers at time t, sample the seating assignment for the 
i'th new customer according to a CRP that depends on the customers from the 

2 A set of sequences {Xi = (x\ , X2, ■ • . , x ni )}, each with n, observations, is partially ex- 
changeable if each sequence is exchangeable, but observations in two different sequences are 
not exchangeable. This is the assumption made by the two-level hierarchical Dirichlct process 
and hierarchical beta process and is appropriate for example modeling text documents where 
words within a document are exchangeable but words in different documents are not. 
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previous time step that remained in the restaurant and the i — 1 new customers 
that entered the restaurant. Dependence is induced through the choice of sub- 
sampling probability q and the transition kernel T(-,-). The authors provide 
sequential Monte Carlo and Gibbs inference algorithms. 

The recurrent Chinese restaurant process [RCRP, 55] is similar to, and some- 
times a special case of, the GPU. In the RCRP all customers leave the restaurant 
at the end of a time step, however, the atom and number of customers assigned 
are remembered for the next time step. The first customer to sit at a table from 
the previous time step is allowed to transition the associated atom. The RCRP 
does not restrict the type of transition to be invariant to the base measure Gq, 
which means that the marginal measures of the RCRP may not all be DPs 
with the same base measure (though they will all be DPs). The RCRP can be 
extended as discussed in [55] to model higher-order correlations by modulating 
the counts from the previous time by a decay function [56], eg e~ h ^ x where h is 
a lag and A determines the length of influence of the counts at each time. 

8.2 Explicit distance dependency in the predictive distri- 



An alternative interpretation of the CRP is to say that customers choose who to 
sit next to uniformly, rather than to say customers pick a table with probability 
proportional to the table size. An assignment, c = {ci, . . . ,c„}, of customers 
to other customers is equivalent to the usual table-based interpretation of the 
CRP by defining two customers i and j to be at the same table, ie zi — Zj, if 
starting at customer i or j, there is a sequence of customer assignments that 
starts with i or j and ends with the other, eg (cj, c Ci , . . . This map is not 
one-to-one in that the same table assignment can be generated by two different 
customer assignments. 

The distance-dependent CRP [ddCRP, [57] utilizes this representation. Let 
dij denote a dissimilarity] measure between customers i and j and let /(•) be 
a monotonically decreasing function of the called a decay function. Using 
the customer assignment interpretation of the CRP, the ddCRP is defined as 
follows 



where a is a concentration parameter. Loosely speaking, a customer is more 
likely to choose to sit next to a person he lives near. 

Different forms for the decay function, /(•), allow for the type of dependence 
to be controlled. For example, a "window" decay function describes an explicit 
limit on the maximum distance between customers that can be assigned to each 
other. Soft decay functions can also be used to allow the influence of customers 
to diminish over time. The ddCRP has been applied to language modeling, a 
dynamic mixture model for text and clustering network data [57] . 

3 dij need not satisfy the triangle inequality. 
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This idea of modifying the predictive distribution to depend on covariates can 
also be applied to the IBP predictive distribution to create covariate-dependent 
latent feature models. Assume there are N customers and let Sij — f(dij) 
denote a similarity between customers i and j and Wij = Sij/YlaLi s u- 

The distance-dependent IBP [ddlBP, [55] first draws a Poisson number of 
dishes for each customer, the dishes drawn for a given customer are said to 
be "owned" by the customer. For a given dish, k, analogously to the ddCRP, 
each customer chooses to attach themselves to another customer (possibly them- 
selves) with probability ife . After this process occurs a binary matrix Z is cre- 
ated where entry z^. = 1 if customer i can reach the owner of dish k following 
the customer assignments. Unlike the ddCRP, the direction of the assignments 
matters: there must exist a sequence of customer assignments starting at i and 
ending at the dish owner. If no such path exists then Zik = 0. The ddlBP has 
been used for covariate-dependent dimension reduction in a classification task 
with a latent feature model. 

9 Dependent Poisson random measures 

Since CRMs and Poisson processes are deeply connected (see Section[2| it is nat- 
ural to construct dependent random measures by manipulating the underlying 
Poisson process. Recall that a Poisson process on O x E + with rate given by a 
positive Levy measure v(d9, dm), defines a CRM on 9. Therefore, any operation 
on a Poisson process that yields a new Poisson process will also yield a new 
CRM. If we ensure that the operation yields a Poisson process with the same 
rate and allow the operation to depend on some covariate, then we define a 
dependent CRM that varies with that covariate. From here, we can define a 
dependent NRM via normalization. 

9.1 Operations on Poisson processes 

In the following, let II = {(xi, 9i 1 Wi)} be a Poisson process on X x 9 x R + with 
the product er-algebra Fx ® F@ <8> [S3] with rate measure v(dx, d6, d-x) . We 
interpret A' as a space of covariates (with a metric), as the parameter space 
and R + the space of atom masses. Below, we describe three properties of Pois- 
son processes that have been leveraged to construct dependent CRMs/NRMs: 
the superposition theorem, transition theorem, the mapping theorem, and the 
restriction theorem. See [60l |6TJ [62] for the general statements of these proper- 
ties. 

The superposition theorem Let T be another Poisson process on the same 
space as II with rate measure tf>(dx, d9, diz). Then the superposition, II U T, is 
again a Poisson process with updated rate measure v + <p. The superposition fol- 
lows from the additive property of Poisson random variables [50] . Additionally, 

4 Since / is a monotonically decreasing function of dissimilarity it can be interpreted as a 
notion of similarity. 
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if {II„} is a countable set of Poisson processes on the same space with respective 
rate measures {vn}, then the superposition, U„II n , is a Poisson process with rate 
measure J2 n v n- 

The transition theorem A transition kernel is a function T : X x 9 x 
J~Xx@ [0, 1], such that for (x, 0) G J-xxB: T(x,9, •) is a probability measure 
on X x 9 and for measurable A, T(-, •, A) is measurable. We denote a sample 
from T as T(x, 0), by which we mean the pair (a;, 0) is moved to the point T(x, 6) 
according to T. The set, T(II) = {(T(x, 0), tt) : (x,0,tt) g n} is then a Poisson 
process with rate J Xx q T{x, 0)v(dx, dd, dir). 

The mapping theorem If n is a Poisson process on some space S with rate 
measure v(-), and / : S — > T is a measureable mapping to some space T, then 
/(II) is a Poisson process on T with rate measure For example, if 

S := X x 9 x M + , and /a(") = Jj-g^ then fa (II) is a Poisson process on 
9 x M + with rate measure f xeA v(dx,dQ,dir). The mapping theorem can be 
obtained as a special case of the transition theorem. 

Random subsampling Let q : X x 9 — > [0, 1] be a measurable function. 
Associate with each atom a Bernoulli random variable Zi such that p(zi — 
1) = q(xi,0i). Then, let lib = {(Xi, 9%, ni)\zi = b}, for b G {0,1} are in- 
dependent Poisson processes such that lib ~ PP(^b) where v (dx, d6, dir) = 
(1 — q(x,9))u(dx,d9,dw) and v\{dx, d9, dir) — q(x 1 9)v{dx 1 d9,d , n) are the re- 
spective rate measures. The CRM associated with IT can then be written as 



This is a special case of the marking theorem; see (SO] and [HT] for an in-depth 
treatment. 

9.2 Superposition and subsampling in a Poisson represen- 



A family of bivariate dependent CRMs (and NRMs via normalization) are con- 
structed in [53] and [51] for partially exchangeable sequences of observations. 
The construction is based on a representation of bivariate Poisson processes 
due to Griffiths and Milne [55]. If N t = M, + M , for i = 1,2 where Mi are 
Poisson processes with rate v and represent idiosyncratic aspects of the 2Vj and 
Mq is a baseline Poisson process with rate measure v, then by the superposition 
theorem, the Ni are Poisson processes with rate 2v. 

This bivariate Poisson process can be used to define a bivariate CRM (Gi, G^) 
that takes the form Gi = X}fc=i ^ik^e^ + Sfc=i ^ok^e ok ■ The bivariate CRM, 
(G\,G2) can then be normalized to give a bivariate NRM, (Pi,/^) which can 
be used as a prior for a mixture for partially exchangeable data. This family 




(11) 
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of dependent CRMs has been additionally used as a prior distribution for the 
hazard rate for partially exchangeable survival data [33] ■ 

Lin et al [9] use the superposition and subsampling theorems above to create 
a discrete-time Markov chain n 1 ,n 2 . . . of Poisson processes, and from there, 
a Markov chain of Dirichlet processes that we denote the Markov-DDP. Let 
v{d9,d/K) — ir^ 1 exp(— ir)dnHQ(d6). At each timestep i, the Poisson process 
IL_i is subsampled with probability q, and is superimposed with an independent 
Poisson process with rate qv(dQ, dir). In addition, the atoms from time i — 1 
are transformed in using a transition kernel T. The Poisson process at each 
timestep defines a gamma process, and indeed we can directly construct the 
sequence of dependent gamma processes as 

Gi ~ CRMH 

G i =T(S q (G i - 1 )) + Z i ,i> 1 1 ' 

where & ~ CRM(gz/) and the operation S q (G) denotes the deletion of each atom 
of G with probability q, corresponding to subsampling the underlying Poisson 
process with probability q. The transition T affects only the locations of the 
atoms, and not their sizes. Lin et al use the gamma process as the CRM in 
Equation [12] to generate dependent Dirichlet processes, but arbitrary CRMs can 
also be used to generate a wider class of dependent NRMs, as described by Chen 
et al [56] . 

The Markov-DDP is a recharacterization of the size-biased deletion variant of 
the GPU DDP described in Section [8] although the MCMC inference approach 



used is different. As is clear from Equation 12 we need not instantiate the 
underlying Poisson process; Lin et al employ an MCMC sampler based on the 
Chinese restaurant process to sample the cluster allocations directly, and Chen 
et al use a slice sampler to perform inference in the underlying CRMs. 

There are certain drawbacks to this construction. Firstly, only discrete, or- 
dered covariates are supported, making it inappropriate for applications such as 
spatial modeling. Second, it is not obvious how to learn the thinning probability 
q. In the literature q is taken to be a fixed constant and an ad-hoc method such 
as cross-validation is needed to find a suitable value. 



9.3 Poisson processes on an augmented space 

An alternative construction of dependent CRMs and NRMs that overcomes 
these drawbacks is to use a covariate-dependent kernel function to modulate 
the weights of a CRM. This technique was explicitly used in constructing the 
kernel beta process, and, as was shown by Foti and Williamson |67j . can be used 
to construct a wider class of dependent models including the spatial-normalized 
gamma process [55J ■ 

As before, let II = {(xi, di, 7^)} be a Poisson process on X x x E + with rate 
measure v(dx,d8 ,dn) = Ro(dx)Ho(d6)i'o(dTv). Additionally, let K : X x X —> 
[0, 1] be a bounded kernel function. Then, define the set of covariate dependent 
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CRMs {G^ : x e X} as 



=J2 K ( x >»kM k . (13) 
fc=i 

By the mapping theorem for Poisson processes this is a well-defined CRM. 

If we take vq to be the Levy measure of the homogeneous beta process, we 
obtain the kernel beta process |15j . Rather than use a single kernel function, 
the KBP uses a dictionary of exponential kernels with varying widths. 

We can use the kernel beta process to construct a distribution over dependent 
binary matrices, by using each G^ to parameterize a Bernoulli process (BeP) 
Z( x ) ~ BeP(G^) [25]. As noted in [15], using G^"> to parameterize a Bernoulli 
process has an Indian buffet metaphor where each customer first decides if 
they're close enough to a dish in the covariate space (K(x,/j,k)) and if so tries 
the dish with probability proportional to its popularity (irk)- 

The kernel beta process has been used in a covariate dependent factor model 
for music segmentation and image denoising and interpolation [15] . Inference 
for the KBP feature model was performed with a Gibbs sampler on a truncated 
version of the measures G^ x ' . 

If the Levy measure Pq in Equation [13] is the Levy measure of a gamma 
process, and we use a box kernel K(x,fi) = I(||x — < W), then we obtain 
a dependent gamma process, where each atom is associated with a location 
in covariate space and contributes to measures within a distance W of that 
location. So, for x, x' € X , two gamma processes G^> and G^ x ' will share more 
atoms the closer x and x' are in X and vice versa. Note that if an atom appears 
in two measures G^ and G^ x > , it will have the same mass in each. 

If we normalize this gamma process at each covariate value, we obtain the 
fixed- window form of the spatial normalized gamma process (SNGP) [68]. Plac- 
ing a prior on the width W allows us to recover the full SNGP model. 

The SNGP can also be obtained using the mapping and subsampling the- 
orems for Poisson processes. Let y be an auxiliary space to be made ex- 
plicit and T an index set which we take to be K and let G be a gamma 
process on y x 6. For t € T let Yt C y be measurable. The measure 
= f Y G(dy,d9) = J2kLi l(ffe e Y t )it k 5e k i s a gamma process by the map- 
ping and sub-sampling (using a fixed thinning probability) theorems for Poisson 
processes where the rate measure has been updated accordingly. A set of de- 
pendent Dirichlet processes is obtained as = G^/G'''(8). As a complete 
example, suppose that each atom is active for a fixed window of width 2L cen- 
tered at a time t. In this case y = R and the sets Y t = [t — L,t + L]. In this 
case, two DPs Z)W and > will share atoms as long as \s — t\ < 2L. See 
|68j for further examples. The SNGP can be used with arbitrary covariates, 
however, covariate spaces of dimension greater than 2 become computationally 
prohibitive because of the geometry of the Y t . 

A related, but different method for creating dependent NRMs using time as 
the underlying covariate was presented in [11]. Let IT = {(fJ-i, 6, 7^)} be a Poisson 
process on K x x M. + with a carefully constructed rate measure v(dt, d6, dir). 
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Then, a family of time- varying NRMs {Gt, t £~R.} can be constructed where 



This family of time-varying NRMs are denoted Ornstein-Uhlenbeck NRMS 
(OUNRM), since the kernel used to define Gt is an Ornstein-Uhlenbeck ker- 
nel. Though similar to the KNRMs presented above, the Levy measure used to 
construct an OUNRM utilizes machinery for stochastic differential equations in 
order to ensure that the marginal measures, Gt, are of the same form as the 
original NRM on the larger space [69]. In particular the OUNRM construction 
makes use of the Ferguson-Klass representation [19] of a stochastic integral with 
respect to a Levy process which dictates the form of v above. Inference can be 
performed with an MCMC algorithm that utilizes Metropolis-Hastings steps as 
well as slice sampling [70]. In addition, a particle filtering algorithm [33] has 
been derived to perform online inference. 

Another method that makes use of the Ferguson-Klass representation is the 
Poisson line process- based dependent CRM [12]. A Poisson line process is a 
Poisson process on the space of infinite straight lines in R 2 [7T] , and so a sample 
from a Poisson line process is a collection of straight lines in the plane, such 
that the number of lines passing through a convex set is Poisson-distributed 
with mean proportional to the measure of that set. 

A useful fact of Poisson line processes is that the intersections of a homoge- 
neous Poisson line process with an arbitrary line in R 2 describes a homogeneous 
Poisson point process on R. Clearly, the intersections of a Poisson line process 
with two lines I, £' £ R 2 describe a dependent Poisson point process - marginally, 
each describes a Poisson point process, but the two processes are correlated. The 
closer the two lines, the greater the correlation between the Poisson processes. 

In [T2] , covariate values are mapped to lines in R 2 , and a Poisson line process 
on R 2 induces a collection of dependent Poisson processes corresponding to those 
values. Judicious choice of the rate measure of the Poisson line process ensures 
that the marginal Poisson processes on R at each covariate value have rate 1/2. 
Taking the absolute value of these Poisson processes with respect to some origin 
gives a unit-rate Poisson process, which can be transformed to give an arbitrary 
CRM (including inhomogeneous variants) via the Ferguson-Klass representation 
as described in Section [2~2l 

10 Superpositions and convex combinations of 
random measures 

An alternative way of looking at the SNGP is as a convex combination of Dirich- 
let processes. In the Poisson process representation of the SNGP, described in 
Section [9] a Poisson process is constructed on an augmented space. At each 
covariate location x £ X , we obtain a covariatc-dcpcndcnt Poisson process by 
restricting this large Poisson process to a subset A x of the augmented space. 
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For a finite number of covariate locations, the corresponding subsets define a 
finite algebra. Let 1Z be the smallest collection of disjoint subsets in this algebra 
such that each A x is a union of subsets in 1Z. Each subset r £ 1Z is associated 
with an independent Poisson process, and hence an independent gamma process 
GM. These gamma processes can be represented as unnormalized Dirichlet pro- 
cesses, with gamma-distributed weights. Each A x is therefore associated with 
a gamma process G^ obtained by the superposition of the gamma processes 
: r g a x n n. if we normalize the gamma process GM we obtain a mix- 
ture of Dirichlet processes, with Dirichlet-distributed weights corresponding to 
the normalized masses of the G^ r '. Similarly, the bivariate random measures 
of [53J EH] and the partially exchangeable model of [30] can be represented as 
superpositions or convex combinations of random measures. 

A number of other models have been obtained via convex combinations of 
random measures associated with locations in covariate space, although these 
models do not necessarily fulfill the desiderata of having marginals distributed 
according to a standard nonparametric process. 

The dynamic Dirichlet process [73] is an autoregressive extension to the 
Dirichlet process. The measure at time-step i is a convex combination of the 
measure at time-step i — 1 and a DP-distributed innovation measure: 



The measure at each time step is this a convex combination of Dirichlet pro- 
cesses. Note that, in general, this will not be marginally distributed according 
to a Dirichlet process, and that the marginal distribution will depend on time. 

The dynamic hierarchical Dirichlet process [73] extends this model to grouped 
data. Here, we introduce the additional structure Hoi := Gq ~ DP (7, if). We 
see that each measure Gt is a convex combination of G\, H±, . . . , H t ~i, and that 
these basis measures are samples from a single HDP. Again, unlike the SNGP, 
this model does not have Dirichlet process-distributed marginals, and can only 
be applied along a single discrete, ordered covariate. 

A related model that is applicable to more general covariate spaces is the 
Bayesian density regression model of |74j . Here, the random probability measure 
G^ x *' at each covariate value xi is given by a convex combination of the measures 
at neighbouring covariate values plus an innovation random measure: 



where (j ~ i) indicates the set of locations within a defined neighborhood of 2j. 
A similar approach has been applied to the beta process to create a dependent 
hierarchical beta process (dHBP) [T3] . 



Gt ~DP( 7o ,G ) 

ff i ~DP(7 i ,i?oi) 

w t ~ Beta,(a w (i),b w (i)) 

G l = (1 - Wi-x)Gi-i + Wi-iHi-i,i > 1. 
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11 Applications 



Dependent nonparametric processes were originally proposed to flexibly model 
the error distribution in a regression model, yi = f(xi) + ti [38 . Whereas 
traditionally the distribution of is Gaussian or some other parametric form, 
using a dependent stochastic process the can be distributed according to an 
arbitrary distribution F x . that may depend on the observed covariate, x^. 

Early research into dependent nonparametric processes focused on the use 
of dependent Dirichlet processes (or Dirichlet-based processes) for use in regres- 
sion and density estimation. The survey by Dunson |16j provides a thorough 
overview of covariate-dependent density estimation. Most of this early work 
originates from the statistics literature. 

The breadth of applications expanded rapidly once the machine learning 
community began to realize the potential of dependent nonparametric pro- 
cesses. In this section, we describe some recent machine learning applications 
of covariate-dependent nonparametric Bayesian models. 

11.1 Image processing 

A number of dependent stochastic processes have been applied to image pro- 
cessing applications, including denoising, inpainting (interpolation), and image 
segmentation. 

In the denoising problem a noisy version of the image is provided and the goal 
is to uncover a smoothed version with the noise removed. In image inpainting 
only a fraction of the pixels are actually observed and the goal is to impute 
the values of missing pixels under the learned model. In both cases a common 
pre-processing step is to break the image up into small patches of dimension 
m X m (m = 8 is frequently used) that are often overlapping. Each patch is 
then treated as a vector yi € R m . 

The most common Bayesian nonparametric model for these problems is a 
sparse factor model where y$ = Dwi + e$, where the columns of D £ M. m xo ° 
are the dictionary elements (or factors), w £ M.°° are the factor weights and 
6j ~ N(0,a e ). are decomposed as Wi = Zj Sj where z\ is a binary vector and 
the entries of Si ~ -/V(0, a s ). In a stationary model, the entries of Zi are typically 
distributed according to an Indian buffet process, or equivalently their latent 
probabilities are described using a beta process. [75]. However, this makes the 
poor assumption that image patches are exchangeable. 

A better approach is to use a dependent nonparametric prior to generate the 
feature probabilities, since in addition to the expectation of global structure, 
each patch overlaps with its neighbors, thus sharing local structure. This form 
of structure can be achieved using the dHBP and KBP; using such models has 
achieved superior denoising and more compact dictionaries [141 115j . 

Another important and challenging problem in image processing is segment- 
ing images into coherent regions, such as "sky", "house", or "dog". This is 
essentially a structured clustering task - each superpixel belongs to a single 
cluster, but proximal superpixels are likely to have similar assignments. 
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This is exactly the sort of structure achieved by the latent stick-breaking 
processes described in Section [7J The dependent Pitman- Yor process has been 
used to achieve state-of-the art segmentation of natural scenes [54] . 

11.2 Music segmentation 

A common task in analyzing music is determining segments of a song that are 
highly correlated in time. One way of modeling the evolution of music is to use 
a hidden Markov model (HMM) to model the Mel frequency cepstral coefficients 
(MFCCs) of a piece of music across time. However, such a model does not allow 
for evolution of the transition distribution accross time. Better results can be 
obtained by using a dymanic HDP as the basis of the HMM, as described by 

ng. 

Another approach is to use a sparse factor model |75] in a manner analogous 
to the image segmentation problem above. Obviously, a stationary model will 
miss local correlations, suggesting the use of a dependent model. The KBP was 
able to achieve better segmentation than both the stationary BPFA model [75] 
and the dymanic HDP-HMM [75] , both of which learned a blockier correlation 
matrix. 

11.3 Topic modeling 

Topic modeling is a class of techniques for modeling documents as exchangeable 
collections of words drawn from a document-specific distribution over a global 
set of "topics" , or distributions over words, for the purpose of decomposing a 
collection of text documents into the underlying topics. The canonical topic 
model is latent Dirichlct allocation [LDA, 77 , a stationary model employing a 
finite number of topics. The HDP has allowed the construction of nonparametric 
topic models with an unbounded number of topics 29J. 

Many corpora evolve over time - for example news archives or journal vol- 
umes. Standard topic models such as LDA and topic models based on the HDP 
assume that documents are exchangeable, but in fact we may expect to see 
changes in the topic distribution over time, as topics wax and wane in proba- 
bility and the language used to describe a topic changes over time. 

A number of dependent Dirichlet processes have been used to create time- 
dependent topic models. For example, the SNGP [68] has been used to construct 
a dynamic HDP topic model. The marginal DP at time t is used as the base 
measure in an HDP that in turn models the topic distribution used by the 
documents at time t. Since atoms are only active for a finite window, each topic 
will appear for only a finite amount of time before disappearing. The generalized 
Polya urn model of 10J has been used in a similar setting; by using the uniform 
deletion formulation of the model, topics are allowed to increase and decrease 
in popularity multiple times. 

In the Markov-DDP of [S], in addition to varying the topic probabilities, the 
topics themselves, which are just (finite) probability distributions over a fixed 
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vocabulary, are allowed to evolve over time. The recurrent Chinese restaurant 
process is used in a similar manner in the Timeline model of [78] 

11.4 Financial applications 

Dependent nonparametric processes can also be applied to problems in finance. 
The volatility of the price of a financial instrument is a measure of the instru- 
ment's price variation. High volatility implies large changes in price and vice 
versa. Modeling volatility is a challenging problem, see [6] for an overview. A 
common model choice is a stochastic volatility model, a simple version of which 
is to model the log-returns, r t definied as logP t — logP t _i for P t the price at 
time t, as r t ~ iV(0, ot). The choice of the distribution for cr t determines the 
type of stochastic volatility model. The OUNRM has been used in this set- 
ting [TT] where we model at ~ Gt and where Gt ~ OUNRM is drawn from an 
OUNRM process with an inverse-gamma distribution base measure to ensure a 
positive variance. The results reported in indicate that the time-dependent 
nonparametric modeling of the volatility fit actual volatility well and by explic- 
itly modeling the time-dependence the model is able to estimate the length of 
the effects of shocks and other interesting phenomena. 

12 Conclusion 

In this paper, we have attempted to provide a snapshot of the current range 
of dependent nonparametric processes employed by the machine learning and 
statistics communities. Our goal was to curate the wide range of papers on this 
topic into a few classes to highlight the main methods being used to induce 
dependency between random measures. We hope that, by highlighting the links 
between seemingly disparate models, the practicioner will find it easier to nav- 
igate the available models or develop new models that are well matched to the 
application at hand. We also hope that, by realizing and leveraging similarities 
between related models, the practitioner will find it easier to identify an efficient 
and appropriate inference technique for the model at hand. 

While the history of dependent stochastic processes is relatively short, mod- 
ern models have moved well beyond the single-p DDP introduced in [TJ. Early 
research in the field was driven by the statistics community, and was mostly 
focused on classical statistical problems such as density estimation and regres- 
sion and adhered to the theoretical desiderata of MacEachern reproduced in 
Section |3l 

As the machine learning community has woken up to the potential of de- 
pendent nonparametric processes, the range of applications considered has bal- 
looned. In particular, the use of dependent nonparametric models has been 
active in areas such as text and image analysis. This expansion has also seen an 
increase in more scaleable inference algorithms based on variational or sequential 
approaches. 
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However, the models introduced by the machine learning community have 
often exhibited less theoretical rigor. Many of the processes developed in ma- 
chine learning do not meet all of the desiderata of MacEeachern. For instance 
many of these processes do not have known marginals and some are not even 
stationary. 

There is certainly an argument that the definition put forth by MacEachern 
is overly restrictive. While models such as the KSBP do not have marginals 
directly corresponding to known nonparametric priors, they are still useful and 
well defined priors. And while models such as the ddCRP and the dHBP rely 
on having a known set of covariate locations and may lack well-defined out-of- 
sample prediction methods, there are many applications - such as image denois- 
ing - where out-of-sample prediction is not an important task. 

The question arises of whether the framework of MacEachern is still appli- 
cable for modern dependent processes. Certainly large support of the prior, 
efficient inference and posterior consistency are still relevant. However, play- 
ing (relatively) fast and loose with the theoretical properties of models has 
often led to more manageable inference algorithms. We feel that the MacEach- 
ern framework is still a useful starting point, but believe that more emphasis 
should be placed on tractable inference, and on ensuring the dependence be- 
tween {G^ : x e X} is appropriate for the data, rather than using an overly 
restrictive form of dependence for the sake of theory. 

Conversely, we hope to see a more rigorous analysis of the theoretical proper- 
ties of dependent nonparametric processes, particularly from the machine learn- 
ing community. We appeal to the statistics community to develop underlying 
theory for the processes from machine learning to provide confidence when us- 
ing them. The successful development of dependent nonparametric processes to 
their fullest potential depends on the complementary interests and expertise of 
the statistics and machine learning communities. 
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